Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Tue Nov 29 02:09:44 2022"

The text continues here.

My thoughts:

I saw an email advertisement of this course and was very keen to join. I would like to learn more about data analytics and I have studied some data analytics using Python and Jupyter Notebook so studying R sounded very interesting. I also took part a couple of years ago to another course organized by Kimmo Vehkalahti (Johdatus Yhteiskuntatilastotieteeseen) and I must say it was all thanks to him that I finally started to understand something about statistics. That course just covers the basics so I am hoping to learn more about statistics in this course. So I am really looking forward to this course.

Reflecting my learning experience

I think using the exercise1 together with the book makes it easier to go through the topics. I first tried to read just the book but it gets a bit tedious. Going it together with the exercise1 makes it a bit more interactive. The syntax in R is somewhat familiar with other programming languages but it also has many differences, so currently I am a bit worried if I remember all those functions and commands. Luckily the R for Health Data Science seems to be quire easy to browse so it will not be a big problem to check some of the functions or syntax later again.

I just did the first four parts for now and it covers the basics so there was nothing too difficult yet. I always struggle with join function and I probably will do that again but it seemed that R is quite helpful here too as you can see the results of the joined tables very easily.

GitHub repository:

https://github.com/Ankinen/IODS-project


Analysing learning2014 data using multiple regression

Describe the work you have done this week and summarize your learning.

date()
## [1] "Tue Nov 29 02:09:44 2022"

Read the learning2014 data and explore the structure and dimensions of it.

learning2014 <- read.table("data/learning2014.csv", sep=",", header=TRUE)
dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

dimensions and structure of the data

There are 166 observations and 7 variables in this data set. The variables are ‘gender’, ‘age’, ‘attitude’, ‘deep’, ‘stra’, ‘surf’, and ‘points’. The ‘deep’, ‘stra’, and ‘surf, refer to students’ learning styles, deep, surface, and strategic learning.

We are now using this data to try to explain the relationship between the points (dependent variable, or y) and age, attitude, deep, surface, strategic (the explanatory variables.) In explainig we use multivariable (or multiple) regression, which is one of the methods used in linear regression.

Graphical overview

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

#graphical overview
plotmatrix  <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
plotmatrix

Summaries of the variables

summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Linear Regression - explaining the plot and summaries

In linear regression we want the the observed data to fit in to a descending or ascending line, because that would mean there most like is some kind of correlation. When the line fits well we can see a linear relationship between the explanatory and dependent variables. If the straight line does not fit well, we need to consider an alternative model. (We will see some regression lines using learning2014 data later.)

The observations should also be normally distributed. If this is the case, the residuals should show a normal distribution with a mean of zero. If we observe the histograms above, we can see that most of them are not normally distributed, but this still too early to do any definite conclusions about the data.

Here we just have a first glimpse of the data so that we can see how it all looks like.

The third assumption that we have in linear modelling is that the residuals are independent. This is more of a concern when doing time series data analysis.

The last assumption is that the residuals have equal variance. This means that the distance of the observations from the fitted line should be the same on the left and on the right.

From the plot matrix we can observe that there are some possible outliers and that gender does not seem to play a big role as both green and red dots are scattered more or less evenly.

From the summary table we can see that the Median and mean values of each variable as well as their minimum and maximum values and the values in 25% and 75% mark.

Let’s further check the relationship between points and three explanatory variables, attitude, stra and surf.

Regression model for points + attitude, stra and surf

# Fit a regression model where `points` is the target variable and `attitude`, `stra` and `surf` are the explanatory variables.
linearmodel <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out the summary of the new model
summary(linearmodel)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Call:

Here we can first see the function call where the points refers to the dependent variable and the variables after the ~ are the explanatory variables.

Residuals:

Next is the residuals they tell the difference between the fitted line and the observations, here the median residual is 0.52. If our model fits well, the residual median should be close to zero and Q1 and Q3 should also be close to each other in magnitude (they are quite near). Also the maximun and minimum values should be near each other. However, in case of the maximum and minimun values, there might be some outliers which are affecting the results.As we can see from the residual boxplot this, indeed, seems to be the case.

#residual boxplot

boxplot(resid(linearmodel))

Coefficients:

Estimates

Intercept represents the mean value of the expected response when all the predictor variables (explanatory variables) are equal to zero. Intercept is not always sensible number as sometimes the variable cannot have a value of zero. In this case it does make some sense because it is possible that the points are zero. For other features (the explanatory variables) the estimates give the estimated change in point when they are changed a unit.

Standard Error Standard error, as the name suggests, is the standard error of our estimate. This allows us to construct marginal confidence intervals. Here you can find more information about the ways to handle confidence intervals in R.

t-value t-value tells us how far our estimated parameter is from a hypothesized 0 value, scaled by the standard deviation of the estimate.

P-value The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (i.e., no effect). If this probability is low enough (5% or under, preferably under) we can reject the null hypothesis. Here R helps us a bit and marks with asterisks the p-values that are statistically significant. But remember to not blindly trust this value!

The Last part of the summary

The last part of the summary is for assessing the fit and overall significance of our model.

Residual Standard Error Residual error gives the standard deviation of the residuals.

Multiple R-squared

R-squared tells the proportion of the variance in the dependent variable that can be explained by the explanatory variables. This means that the R-squared shows us how well the data fit our regression model. This figure does not give us any information about the causation relationship between the dependent and explanatory variables. It also does not indicate the correctness of the model. The higher the number, the more variability is explained. Our model is not clearly explaining our variables well. However, this number is also something we need to be cautious about. High score can indicate also that our model is overfitting to our data.

F-Statistic F-test will tell you if means are significantly different.F-test will tell us if the group of variables are jointly significant.

Removing variables that are not statistically significant

The previous summary showed us that only the attitude had a significant relationship with the dependent variable. In the following model only attitude is used in the model.

# Fit a regression model where `points` is the target variable and `attitude`, `stra` and `surf` are the explanatory variables.
linearmodelfitted <- lm(points ~ attitude, data = learning2014)

# print out the summary of the new model
summary(linearmodelfitted)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Compare the results with the previous model

We can see that there is no much change and that actually multiple R-square value is lower. Based on this, we can say that it does not seem that the data does not fit any better to this simpler model.

Diagnostic Plots

In the plot function we can select from six different plots (at least). Here we have chosen number 1, i.e., Residual vs Fitted values, number 2, Normal QQ-plot and number 5, Residuals vs Leverage.

which graphic
1 Residuals vs Fitted values
2 Normal QQ-plot
3 Standardized residuals vs Fitted values
4 Cook’s distances
5 Residuals vs Leverage
6 Cook’s distance vs Leverage

Residuals vs Fitted values When this scatterplot shows a good fit for the linear model, the points appear to be randomly spread out about the line and there is no apparent non-linear trends or indications of non-constant variable. The red line which shows the average value of the residuals at each value of fitted value is perfectly flat. This also tells us that there is no discernible non-linear trend to the residuals. The residuals should also be equally variable across the entire range of fitted values.

In the last model, where we only have attitude as the explanatory variable, the residuals vs Fitted values plot at first sight seems to be quite OK. However,the red line starts to go up somewhere around the Fitted value of 24. This indicates that there are some non-constant variance in our errors. The scatterplots are also gathered more evenly from around 20 to 26, below 20 and above 26 the scatterplot is not as even.This means that we should not believe our confidence intervals, prediction bands or the p-values in our regression.

Normal QQ-plot The QQ plot helps us to assess if the residuals are normally distributed. When the residuals are matching perfectly the diagonal line, they are normally distributed.

In our model we can see that the lower tail is heavier, i.e., they have larger values than what we would expect under the standard modeling assumptions. The upper tail on the other hand is lighter indicating that we have smaller values than what we would expect under the standard modeling assumptions. This means that our residuals are not normally distributed.

Residuals vs Leverage One way of defining outliers are that they are points that are not approximated well by the model. This means that they have a large residual. This significantly influences model fit. Residuals vs Leverage is a way to have an estimate about the outliers.

In a perfectly fit model, the Cook’s distance curves do not even appear on the plot. None of the points come close to have both high residual and leverage.

In our model, we have a few point that have a very high residual and some that have very high leverage.

Looks to me that there might be some outliers in the data that should be taken into consideration (delete from the dataset?) to make the model better fit for the data. The first scatterplots in the plotmatrix also indicated the possibility of outliers so this result of the analysis is no surprise.

Definitely our model need some further adjustment (at least) to be able to explain the dependent variable with the explanatory variables. However, the plots do not show anything that bad that would directly indicate that our explanatory variables would be completely useless either.

par(mfrow = c(2,2))
plot(linearmodelfitted, which = c(1,2,5))

Additional sources: https://boostedml.com/2019/06/linear-regression-in-r-interpreting-summarylm.html https://dss.princeton.edu/online_help/analysis/interpreting_regression.htm https://blog.minitab.com/en/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients https://www.statology.org/intercept-in-regression/#:~:text=The%20intercept%20(sometimes%20called%20the,model%20are%20equal%20to%20zero.


(more chapters to be added similarly as we proceed with the course!)

Logistic regression

date()
## [1] "Tue Nov 29 02:09:53 2022"

Bring in the libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(readr)
library(finalfit)
library(GGally)
library(boot)

Read the table to dataframe alc

alc <- read.table("data/alc.csv", sep=",", header=TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc)
## [1] 370  35
str(alc)
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

This data were collected by using school reports and questionnaires in order to study student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features). Alc data consists of two datasets: Mathematics (mat) and Portuguese language (por). The two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. (Source: https://archive.ics.uci.edu/ml/datasets/Student+Performance)

Analysis

The hypotheses:

  • h0 There is no relationship between the students alcohol consumption and the absences, sex, family support and mother’s education level.
  • h1 The alcohol consumption is higher with students who have high number of absences.
  • h2 The alcohol consumption is higher among male than female students.
  • h3 The alcohol consumption is higher among male students whose mothers’ education level is low than female students whose mothers’ education level is low.
  • h4 The alcohol consumption is higher among male students who do not receive family support than female students who do not receive family support.
  • h5 The alcohol consumption is higher among male students with high absences number than female students with high absences number.
  • h6 The alcohol consumption is higher among students who have high number of absences, who are male, have no family support and whose mothers’ education level is low.

Check the summary statistics of the dataset grouped by sex

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_grade
##   <chr> <lgl>    <int>      <dbl>
## 1 F     FALSE      154       11.4
## 2 F     TRUE        41       11.8
## 3 M     FALSE      105       12.3
## 4 M     TRUE        70       10.3

Here we can see that there are 154 female students whose alcohol consumption is not high (i.e. combined workday and weekend alcohol consumption is less than 2). Their mean final grade was 11.4. There were 41 females whose alcohol consumption is defined a high and their mean final grade was 11.8. For male students, there were 105 whose alcohol consumption was low and their mean final grade was 12.3. And for the male students whose alcohol consumption was high, their mean final grade was 10.3.

From this, it seem quite obvious that at least for male students, the alcohol consumption is very likely to affect the students school success. For female it seems to be the opposite but the difference is much smaller so the relationship is does not seem so obvious.

Let’s check how the boxplot looks like:

# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade") + ggtitle("Student final grade distribution by alcohol consumption and sex")

From the boxplot we can see that for females the students with low alcohol consumption the variance in final grade is larger and even though the mean is a bit lower than with the females define with high alcohol consumption their overall grade reaches higher. So this might explain the slight higher mean in females with high alcohol consumption.

We’ll take another variable, absences. First we check the numbers of high alcohol use students and their means of absences and then see how its distribution by alcohol consumption and sex looks like:

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absence = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_absence
##   <chr> <lgl>    <int>        <dbl>
## 1 F     FALSE      154         4.25
## 2 F     TRUE        41         6.85
## 3 M     FALSE      105         2.91
## 4 M     TRUE        70         6.1
# initialize a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")

We can see that the number look quite similar with the means of final grades. Both females and males who are defined as high in their alcohol consumption also have higher mean in the number of school absences. This would seem to confirm our hypothesis h1, “the alcohol consumption is higher with students who have high number of absences.”

Just out of curiosity, lets take two more variables, medu and famsup. Medu is the mother’s education level. This variable has a numeric value ranging from 0 to 4 where 0 is no education, 1 is praimary education (4th grade), 2 is 5th to 9th grade, 3 is secondary education and 4 is higher education. Famsup is family support and it has two values, yes and no depending if the student receives educational support from family or not.

Let’s see the corresponding summaries first:

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_medu = mean(Medu))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_medu
##   <chr> <lgl>    <int>     <dbl>
## 1 F     FALSE      154      2.69
## 2 F     TRUE        41      2.76
## 3 M     FALSE      105      2.96
## 4 M     TRUE        70      2.81
alc %>% group_by(sex, high_use, famsup) %>% summarise(count = n())
## `summarise()` has grouped output by 'sex', 'high_use'. You can override using
## the `.groups` argument.
## # A tibble: 8 × 4
## # Groups:   sex, high_use [4]
##   sex   high_use famsup count
##   <chr> <lgl>    <chr>  <int>
## 1 F     FALSE    no        47
## 2 F     FALSE    yes      107
## 3 F     TRUE     no        11
## 4 F     TRUE     yes       30
## 5 M     FALSE    no        47
## 6 M     FALSE    yes       58
## 7 M     TRUE     no        34
## 8 M     TRUE     yes       36

From these numbers it seems that there are no big differences between the students whose alcohol consumption is high and those whose alcohol consumption is low. But lest see the box plot for mother’s educational level. Because the Family support is a binary variable, drawing a box plot would not make it very informative. So we do not draw that.

# initialize a plot of high_use and absences
g3 <- ggplot(alc, aes(x = high_use, y = Medu, col = sex))

# define the plot as a boxplot and draw it
g3 + geom_boxplot() + ylab("Mother's education level") + ggtitle("Mother's education level by students alcohol consumption and sex")

We can draw a plot box to see the relationship of alcohol consumption, family support and absence, so we can do that. This box plot is not now directly comparable with the previous tables but we can see that there is some relationship between these three variables too.

# initialize a plot of high_use and absences
g4 <- ggplot(alc, aes(x = high_use, y = absences, col = famsup))

# define the plot as a boxplot and draw it
g4 + geom_boxplot() + ggtitle("Mother's education level by students alcohol consumption and sex")

From the box plots we can already see that our hypothesis h3 (The alcohol consumption is higher among male students whose mothers’ education level is low than female students whose mothers’ education level is low.) is not true. There is no significant difference between the male students whose alcohol consumption is high than with the female students with high alcohol consumption. Actually, we can see that there is no difference between the male and female students whose alcohol consumption is low either. So it seems, as was noted earlier that mothers’ education level does not have any relationship with the students’ alcohol consumption.

There is some evidence showing that the hypotheses h4 and h5, i.e., “h4 The alcohol consumption is higher among male students who do not receive family support than female students who do not receive family support” and “h5 The alcohol consumption is higher among male students with high absences number than female students with high absences number.” Are true but it is not clear how significant this relationship is.

Finally, lets see a summary. It appears that students whose alcohol use is defined as high have higher number of absences. they and are likelier to be male. There is a slight higher number of students whose alcohol consumption is high with high alcohol use but that difference is not significant. And lastly, as we have already discovered, there is basically no relationship between mothers’ education level and the alcohol use. I would normally drop the mothers education level out from further analysis but here I keep it just for the exercise’s sake. I would probably keep the family support variable but do further analysis to select the most parsimonious model that gives an adequate description of the data.

dependent <- "high_use"
explanatory <- c("absences", "sex", "famsup", "Medu")
alc %>% 
  summary_factorlist(dependent, explanatory, p = TRUE,
                     add_dependent_label = TRUE)
##  Dependent: high_use                FALSE      TRUE      p
##             absences Mean (SD)  3.7 (4.5) 6.4 (7.1) <0.001
##                  sex         F 154 (59.5) 41 (36.9) <0.001
##                              M 105 (40.5) 70 (63.1)       
##               famsup        no  94 (36.3) 45 (40.5)  0.512
##                            yes 165 (63.7) 66 (59.5)       
##                 Medu Mean (SD)  2.8 (1.1) 2.8 (1.1)  0.933

Logistic regression

Let’s explore the distributions of the chosen variables and their relationships with alcohol consumption. Lets first check that the explanatory variables, i.e., sex, absences, family support and mother’s education level are not correlated with one another. The ggpairs() function is a good way of visualising all two-way associations

explanatory <- c("absences", "famsup", "Medu")
alc %>% 
  remove_labels() %>%
  ggpairs(columns = explanatory, ggplot2::aes(color=sex))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can check the precense of high-order correlations with variance inflation factor which can be calculated for each of the terms.Variance inflation factor tells us how much the variance of a particular regression coefficient is increased due to the presence of multicollinearity in the model. GVIF stands for generalised variance inflation factor. According to the R for Health Data Science book, “a common rule of thumb is that if this is greater than 5-10 for any variable, then multicollinearity may exist. The model should be further explored and the terms removed or reduced”. None of the factors are greater than 5-10 so there is no suggestion of any multicollinearity existing.

dependent <- "high_use"
explanatory <- c("absences", "sex", "Medu", "famsup")
alc %>% 
  glmmulti(dependent, explanatory) %>%
  car::vif()
## absences      sex     Medu   famsup 
## 1.032858 1.064500 1.052280 1.053462

We use glm() to fit a logistic regression model with high_use as the target variable and absences, sex, Medu, and famsup as the predictors.

# find the model with glm()
m <- glm(high_use ~ absences + sex + Medu + famsup, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + sex + Medu + famsup, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2968  -0.8763  -0.6067   1.1127   2.0637  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.585134   0.382978  -4.139 3.49e-05 ***
## absences     0.099038   0.023618   4.193 2.75e-05 ***
## sexM         1.057917   0.249816   4.235 2.29e-05 ***
## Medu        -0.102692   0.113867  -0.902    0.367    
## famsupyes   -0.006895   0.251175  -0.027    0.978    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 414.82  on 365  degrees of freedom
## AIC: 424.82
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
##  (Intercept)     absences         sexM         Medu    famsupyes 
## -1.585134370  0.099037567  1.057917219 -0.102691836 -0.006895319
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2049203 0.09470666 0.4265125
## absences    1.1041078 1.05649976 1.1590633
## sexM        2.8803656 1.77683802 4.7406281
## Medu        0.9024050 0.72142713 1.1285336
## famsupyes   0.9931284 0.60860371 1.6321229

From the Coefficients we can see that it is safe to say that the null hypothesis that there is no relationship between the students alcohol consumption and the absences, sex, family support and mother’s education level.It is clear that the number of absences and being a man have a relationship with high alcohol consumption. This seems to also confirm that our hypothesis h2, “the alcohol consumption is higher among male than female students” is also true.

We already discarder the hypothesis h3, “the alcohol consumption is higher among male students whose mothers’ education level is low than female students whose mothers’ education level is low” and it seems that also hypothesis h4 “the alcohol consumption is higher among male students who do not receive family support than female students who do not receive family support” is also not true. For both variables, family support and mother’s education level the Odds ration is close to 1 which suggest that these variables have no affect or the affect is very limited to students alcohol consumption. This is further confirmed with the confidence interval levels, as the CI of both variables is rather small.

The model seems to confirm the hypothesis h5, that the alcohol consumption is higher among male students with high absences number than female students with high absences number. This is because of the coefficients show high significance for male sex and absences. This is further confirmed with the odds ratio and CI levels of these variables. Although, the odd ratio for absences is just a little bit over 1.

The last hypothesis h6, “the alcohol consumption is higher among students who have high number of absences, who are male, have no family support and whose mothers’ education level is low’, is also not true as family support and mother’s education level does not have a significant relationship with the students alcohol consumption.

Here Odds ratio plot which might bring what is said above more clear. Odds ratio value of 1 is the null effect. This means that if the horizontal line crosses this line it does not illustrate statistically significant result.

dependent <- "high_use"
explanatory_multi <- c("absences", "sex", "Medu", "famsup")
alc %>% 
  or_plot(dependent, explanatory_multi,
          breaks = c(0.5, 1, 2, 5, 10, 25),
          table_text_size = 3.5,
          title_text_size = 16)
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Warning: Removed 2 rows containing missing values (geom_errorbarh).

The Predictive power of the model

Let’s check how well does our model actually predicts the target variable. We use the predict() function to make predictions with a model object. We use the ‘type’ argument of predict() to get the predictions as probabilities (instead of log of odds, which is the default).

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# look at the first ten observations of the data, along with the predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% head(10)
##    failures absences sex high_use probability prediction
## 1         0        5   F    FALSE   0.1823191      FALSE
## 2         0        3   F    FALSE   0.1981958      FALSE
## 3         2        8   F     TRUE   0.2899708      FALSE
## 4         0        1   F    FALSE   0.1296836      FALSE
## 5         0        2   F    FALSE   0.1542003      FALSE
## 6         0        8   M    FALSE   0.4619290      FALSE
## 7         0        0   M    FALSE   0.3246243      FALSE
## 8         0        4   F    FALSE   0.1670547      FALSE
## 9         0        0   M    FALSE   0.3010742      FALSE
## 10        0        0   M    FALSE   0.3010742      FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   251    8
##    TRUE     84   27

The cross tabulation of the predictions versus the actual values (i.e., the confusion matrix) shows that the precision of this model was 27/35, i.e. around 77% and the recall was 27/111, i.e., around 24%

The F1 score of the model is: True Positives/(True Positives + ((False Negatives + False Positives)/2)) which in our case is around 37%. This means that our model does not work very well at all.

If we compare with the numbers in Exercise3, this model is slightly worse. When we would compare this model with simple guessing, simple guessing that someone has a high alcohol consumption would be 50%, so our model is worse than just a guess.

Simple measure of performance

Lets check the penalty function of our model. The less we have penalty the better the model.The loss function of logistic regression shows the loss (the penalty if we predict 0 (in the first calculation) or 1 (in the second calculation)). Here we can see that the cost of predicting 0 is 0.3 which is not too bad but the cost of predicting 1 is 07, which on the other hand is not good at all.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486486

Bonus: 10-Fold Cross-Validation

cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.272973

This model seemed to do slightly better than the model in Exercise3. However, it still did not perform very well. If we want to have model with better performance, there are plenty of ways to do it. Good idea is also to use more time in selecting the variables. You can for example, calculate the R-squared for the last variable added to the model. This will tell how much the R-squared increased for each variable so it will represent the improvement in the goodness-of-fit. (see for example, https://statisticsbyjim.com/regression/identifying-important-independent-variables/)

Clustering and Classification

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. The dataset contains a total of 506 cases. There are 14 attributes in each case of the dataset.

Variables in order:

  1. CRIM - per capita crime rate by town
  2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS - proportion of non-retail business acres per town.
  4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX - nitric oxides concentration (parts per 10 million)
  6. RM - average number of rooms per dwelling
  7. AGE - proportion of owner-occupied units built prior to 1940
  8. DIS - weighted distances to five Boston employment centres
  9. RAD - index of accessibility to radial highways
  10. TAX - full-value property-tax rate per $10,000
  11. PTRATIO - pupil-teacher ratio by town
  12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT - % lower status of the population
  14. MEDV - Median value of owner-occupied homes in $1000’s
# libraries
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(GGally)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded

Load the dataset and check how the data looks

Load the Boston dataset

# load the data
data("Boston")

Explore the Dataset

Structure:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Dimensions:

dim(Boston)
## [1] 506  14

Summary of the dataset:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

From the summary we can notice that many variables have quite interesting values. This is because of many of the variables are either dummy variables (like chas) or they depict proportions, like for example, zn and indus. This is something we do not need to worry but need to keep in mind when analysing some of the numbers.

Graphical overview od the data

https://imjbmkz.medium.com/analyzing-boston-housing-dataset-2c7928f2a87f https://rstudio-pubs-static.s3.amazonaws.com/388596_e21196f1adf04e0ea7cd68edd9eba966.html

# plot matrix of the variables
boston_plot <- pairs(Boston[6:10])

From the plotmatrix we can see that most of the variables seem to have some sort of relationship with each other. This suggest that we better think these variables together using multivariate analysis.

Boxplot matrix

boxplot(as.data.frame(Boston))

By observing the boxplots, we can see that the box plot for tax is relatively tall.This is because the values in this variable are much bigger and also the difference between the minimum and maximum is larger. This is also why the position of the tax box plot is much higher than the other plots. Because of the tax, the scale is large. It is better to check box plots of the other variables separately.

boxplot(as.data.frame(Boston[1:9]))

Now we can see that the variables crim and zn seem to have quite a lot of outliers. Chas is a bit special variable because it is a conditional and categorical variable, getting 1 if tract bounds river and 0 if it doesn’t.For zn and rad the median is much closer to the minimum value than the maximum. Let’s check the four last variables ptratio, black, lstat and medv.

PTRATIO - pupil-teacher ratio by town and B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT - % lower status of the population 14. MEDV - Median value of owner-occupied homes in $1000’s

boxplot(as.data.frame(Boston[11]))

boxplot(as.data.frame(Boston[12]))

boxplot(as.data.frame(Boston[13:14]))

From the boxplots we can see that also the variables black and medv seem to have large number of outliers. Let’s check how the outliers that lie outside of the interval formed by the 1 and 99 percentiles. We could do the same with other varibales were we noticed some possible outliers and then make the decision if we should live them or remove them from the dataset. Here we keep them baring in mind that these outliers might impact our results in the end.

lower_bound <- quantile(Boston$black, 0.01)
upper_bound <- quantile(Boston$black, 0.99)

outlier_ind <- which(Boston$black < lower_bound | Boston$black > upper_bound)

Boston[outlier_ind, ]
##         crim zn indus chas   nox    rm   age    dis rad tax ptratio black lstat
## 411 51.13580  0  18.1    0 0.597 5.757 100.0 1.4130  24 666    20.2  2.60 10.11
## 424  7.05042  0  18.1    0 0.614 6.103  85.1 2.0218  24 666    20.2  2.52 23.29
## 425  8.79212  0  18.1    0 0.584 5.565  70.6 2.0635  24 666    20.2  3.65 17.16
## 451  6.71772  0  18.1    0 0.713 6.749  92.6 2.3236  24 666    20.2  0.32 17.44
## 455  9.51363  0  18.1    0 0.713 6.728  94.1 2.4961  24 666    20.2  6.68 18.71
## 458  8.20058  0  18.1    0 0.713 5.936  80.3 2.7792  24 666    20.2  3.50 16.94
##     medv
## 411 15.0
## 424 13.4
## 425 11.7
## 451 13.4
## 455 14.9
## 458 13.5
# source: https://statsandr.com/blog/outliers-detection-in-r/
# Correlation Matrix
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.7)

Standardize the dataset and create a new variable, crime rate

Standardize the dataset

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables with standardized variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# check that we succeeded and we have a dataframe
class(boston_scaled)
## [1] "data.frame"

With standardization we bring all the varibales into the same scale, so that the mean is always 0. This is especially useful when doing multivariate analysis and especially when we do clustering. In clustering variables with very different scales can mess the calucalations as the idea is to calculate the distances between each pair of the n individuals (the rows in a dataframe).

Categorical variable of crime rate

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

First we check how the scaled variable crime rate looks like. This is to decide how many parts we want to divide this varibale. It seems reasonable to use the quantile division. As we can now use this categorical variable, which is better when creating clusters, we can also get rid of the crim variable.

Divide the dataset to train set and test set

In order to have any idea, how our clustering works, we need to divide the dataset into two parts. Train set is 80% of the dataset and this we will use to train our model. The rest 20% we are going to use as test set and see, how well our model actually worked. Because we don’t want the test set to know the right answers already beforehand, we remove the crime variable from the test set. We use it later to evaluate how well we managed to cluster the test data with our model.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
n
## [1] 506
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data [part of task 6]
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Use the Linear Discriminant Analysis to train our data

First we fit our data to the model. The lda function takes a formula (like in regression) as a first argument. We use the crime as a target variable and all the other variables as predictors.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2400990 0.2524752 0.2623762 
## 
## Group means:
##                  zn      indus        chas        nox          rm        age
## low       0.8945730 -0.9217082 -0.11325431 -0.8639562  0.41163840 -0.8551991
## med_low  -0.0295157 -0.2902341 -0.06938576 -0.5391858 -0.10603266 -0.3394565
## med_high -0.3930786  0.1843622  0.26805724  0.3517150  0.08925024  0.4116978
## high     -0.4872402  1.0170298 -0.04947434  1.0417071 -0.37947402  0.8052742
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8821788 -0.6872140 -0.7478183 -0.40170757  0.3789089 -0.75008408
## med_low   0.3608409 -0.5473481 -0.4473305 -0.08422278  0.3124258 -0.12123459
## med_high -0.3492122 -0.3918744 -0.2773108 -0.17826118  0.1044374 -0.01314196
## high     -0.8524455  1.6390172  1.5146914  0.78181164 -0.8018076  0.85349514
##                  medv
## low       0.508342022
## med_low  -0.002334008
## med_high  0.141619338
## high     -0.672255774
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.11535398  0.73481239 -0.92575812
## indus   -0.01101754 -0.28370046  0.34217762
## chas    -0.07384922 -0.09764259  0.03324171
## nox      0.41063880 -0.70409335 -1.35457971
## rm      -0.11515062 -0.07423933 -0.06608964
## age      0.34021948 -0.32303811 -0.33078922
## dis     -0.03987144 -0.25037133  0.07807087
## rad      3.00194821  0.91743006 -0.23672054
## tax     -0.02157060  0.01475147  0.71073045
## ptratio  0.11358122 -0.03171892 -0.31825504
## black   -0.14255077  0.02607955  0.15747647
## lstat    0.13586082 -0.09414004  0.53057460
## medv     0.14721053 -0.30298315 -0.23097301
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9507 0.0379 0.0114
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Make predictions with our test data

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16      11        1    0
##   med_low    0      25        4    0
##   med_high   0       4       20    0
##   high       0       0        0   21

I would say, without doing anything else than just scaling the dataset this is not too bad. The model predicts correctly all the highs, 80% of the med_higs, and 52% of the med_lows and 65% of the lows. Better than just guessing!

Calculate the distances

Let’s use the scaled dataset to calculate the distances between each n. Fist, we use the Euclidean distance measure, which is a straight line distance between 2 data points. Second, we use Manhattan distance, which is the distance between two points in an N-dimensional vector space.Manhattan Distance is preferred over the Euclidean distance metric when the dimension of the data increases.

Source: https://medium.com/@kunal_gohrani/different-types-of-distance-metrics-used-in-machine-learning-e9928c5e26c7

boston_scaled_new <- as.data.frame(scale(Boston))

# euclidean distance matrix
dist_eu <- dist(boston_scaled_new)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled_new, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

K-means algortihm

Firs we set the number of clusters to 3 and see how the clustering works by using ggpairs as the visualization tool:

# k-means clustering
km <- kmeans(boston_scaled_new, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled_new[6:10], col = km$cluster)

Then we investigate what would be the optimal number of clusters and run the algorithm again.

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- 10
k_max
## [1] 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <- kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)

After inspecting the resulst using within cluster sum of square we can see from the picture that total WCSS drops radically at around 2 clusters, hence we choose 2 as the optimal number of clusters in this model. After we plot Boston dataset using with the clusters, we plot the same variables as before just to make it easier to see the differences. We can see by comparing the plot with three clusters and with 2 clusters that two clusters seem to form groups that have less overlapping then with the 3 cluster plot.

Bonus tasks

Perform k-means on the standardized Boston data with some reasonable number of clusters (> 2)

set.seed(13)
# k-means clustering
km_boston <- kmeans(boston_scaled_new, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled_new[6:10], col = km$cluster)

Perform LDA using the clusters as target classes.

All the the variables in the Boston data in the LDA model are included. After fitting the data to the model, we visualize the results with a biplot.

boston_scaled_bonus <- as.data.frame(scale(Boston))
# create a categorical variable 'crime' abd remove crim
bins <- quantile(boston_scaled_bonus$crim)
crime <- cut(boston_scaled_bonus$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# add the new categorical value to scaled data
boston_scaled_bonus <- data.frame(boston_scaled_bonus, crime)

boston_scaled_bonus <- dplyr::select(boston_scaled_bonus, -crim)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = boston_scaled_bonus)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = boston_scaled_bonus)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2509881 0.2490119 0.2490119 0.2509881 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       0.96399896 -0.9118363 -0.11732512 -0.8783001  0.4440813 -0.8853416
## med_low  -0.09675198 -0.2910375 -0.02235445 -0.5655031 -0.1339316 -0.3391655
## med_high -0.38379059  0.1853483  0.19637334  0.3869312  0.1006223  0.4124508
## high     -0.48724019  1.0166933 -0.05532354  1.0554659 -0.4110343  0.8126334
##                 dis        rad        tax     ptratio      black      lstat
## low       0.8777688 -0.6897808 -0.7381779 -0.44318464  0.3787346 -0.7705123
## med_low   0.3516162 -0.5480059 -0.4770689 -0.06194383  0.3227717 -0.1325955
## med_high -0.3764265 -0.4121951 -0.3080609 -0.28336515  0.0872293  0.0127299
## high     -0.8531538  1.6424211  1.5171256  0.78577465 -0.7855072  0.8894341
##                  medv
## low       0.525956037
## med_low  -0.002531505
## med_high  0.170832244
## high     -0.692931573
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.101362020  0.71021035 -0.94401212
## indus    0.027739389 -0.25722619  0.28582296
## chas    -0.071824244 -0.05644265  0.08605965
## nox      0.368585037 -0.73079397 -1.36969686
## rm      -0.079177768 -0.07808680 -0.16397501
## age      0.251832283 -0.31611016 -0.14279184
## dis     -0.072503180 -0.26487036  0.16558054
## rad      3.251389429  0.93949138 -0.06819986
## tax      0.008751469 -0.01233004  0.56822607
## ptratio  0.122582335  0.02144323 -0.28114348
## black   -0.126522553  0.02006734  0.13121934
## lstat    0.209706055 -0.22672353  0.37709356
## medv     0.169105049 -0.39371422 -0.19149321
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9512 0.0366 0.0123
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(boston_scaled_bonus$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

From the biplot we can see that rad explains more of the high crime, zn the low and med_low and nox more the med_high crime rates. However, only rad is influential enough to separate the high crime class very cleary although, in this model there are still some med_highs included too.

Super-Bonus

  1. Super-Bonus:

Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

The code matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling did not work. In this page: https://stats.stackexchange.com/questions/82497/can-the-scaling-values-in-a-linear-discriminant-analysis-lda-be-used-to-plot-e It says: “data.lda <- data.frame(varnames=rownames(coef(iris.lda)), coef(iris.lda)) #coef(iris.lda) is equivalent to iris.lda$scaling” This is done for the iris dataset but the idea is the same, so that is why I changed the lda.fit$scaling to coef(lda.fit)

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% coef(lda.fit)
matrix_product <- as.data.frame(matrix_product)

# Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? (0-3 points to compensate any loss of points from the above exercises)